Template for test
In [1]:
from pred import Predictor
from pred import sequence_vector
from pred import chemical_vector
Controlling for Random Negatve vs Sans Random in Imbalanced Techniques using K acytelation.
Training data is from CUCKOO group and benchmarks are from dbptm.
In [2]:
par = ["pass", "ADASYN", "SMOTEENN", "random_under_sample", "ncl", "near_miss"]
for i in par:
print("y", i)
y = Predictor()
y.load_data(file="Data/Training/k_acetylation.csv")
y.process_data(vector_function="sequence", amino_acid="K", imbalance_function=i, random_data=0)
y.supervised_training("forest")
y.benchmark("Data/Benchmarks/acet.csv", "K")
del y
print("x", i)
x = Predictor()
x.load_data(file="Data/Training/k_acetylation.csv")
x.process_data(vector_function="sequence", amino_acid="K", imbalance_function=i, random_data=1)
x.supervised_training("forest")
x.benchmark("Data/Benchmarks/acet.csv", "K")
del x
y pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Training Data Points: 296118
Test Data Points: 32902
Starting Training
Done training
Test Results
Sensitivity: 0.0026917900403768506
Specificity : 0.9997823315401598
Accuracy: 0.9772658197070087
ROC 0.50123706079
TP 2 FP 7 TN 32152 FN 741
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.0015556938394523958
Specificity : 0.999834290764472
Accuracy: 0.9184465526863173
ROC 0.500694992302
TP 5 FP 6 TN 36202 FN 3209
None
x pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 625138
Test Data Points: 32902
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 32103 FN 799
None
Number of data points in benchmark 39422
Benchmark Results
Failed
TP 0 FP 0 TN 36208 FN 3214
None
y ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 576608
Test Data Points: 64068
Starting Training
Done training
Test Results
Sensitivity: 0.7362558502340094
Specificity : 0.9512461740271098
Accuracy: 0.8436973215958045
ROC 0.843751012131
TP 23597 FP 1561 TN 30457 FN 8453
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.07871810827629122
Specificity : 0.9511710119310649
Accuracy: 0.8800416011364213
ROC 0.514944560104
TP 253 FP 1768 TN 34440 FN 2961
None
x ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 905628
Test Data Points: 64068
Starting Training
Done training
Test Results
Sensitivity: 0.7705241405222619
Specificity : 0.9575474634298163
Accuracy: 0.8643160392083411
ROC 0.864035801976
TP 24609 FP 1364 TN 30766 FN 7329
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.04449284380833852
Specificity : 0.9657534246575342
Accuracy: 0.89064481761453
ROC 0.505123134233
TP 143 FP 1240 TN 34968 FN 3071
None
y SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 504840
Test Data Points: 56094
Starting Training
Done training
Test Results
Sensitivity: 0.7794756413862067
Specificity : 0.6910305897978666
Accuracy: 0.7414875031197633
ROC 0.735253115592
TP 24944 FP 7444 TN 16649 FN 7057
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.3833229620410703
Specificity : 0.6741051701281485
Accuracy: 0.6503982547815941
ROC 0.528714066085
TP 1232 FP 11800 TN 24408 FN 1982
None
x SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 833981
Test Data Points: 56107
Starting Training
Done training
Test Results
Sensitivity: 0.8815531564045923
Specificity : 0.5688475340065092
Accuracy: 0.7479815352808028
ROC 0.725200345206
TP 28334 FP 10333 TN 13633 FN 3807
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.5087118855009334
Specificity : 0.5549049933716306
Accuracy: 0.5511389579422658
ROC 0.531808439436
TP 1635 FP 16116 TN 20092 FN 1579
None
y random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 14542
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.5515923566878981
Specificity : 0.5210589651022864
Accuracy: 0.5358910891089109
ROC 0.536325660895
TP 433 FP 398 TN 433 FN 352
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.5248911014312383
Specificity : 0.5439681838267786
Accuracy: 0.5424128659124346
ROC 0.534429642629
TP 1687 FP 16512 TN 19696 FN 1527
None
x random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 343562
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.3639618138424821
Specificity : 0.7236503856041131
Accuracy: 0.5371287128712872
ROC 0.543806099723
TP 305 FP 215 TN 563 FN 533
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.3061605476042315
Specificity : 0.7723155103844455
Accuracy: 0.7343107909289229
ROC 0.539238028994
TP 984 FP 8244 TN 27964 FN 2230
None
y ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 283168
Test Data Points: 31464
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 30658 FN 806
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.00031113876789047915
Specificity : 0.9999723817940787
Accuracy: 0.9184719192329156
ROC 0.500141760281
TP 1 FP 1 TN 36207 FN 3213
None
x ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 612188
Test Data Points: 31464
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 30597 FN 867
None
Number of data points in benchmark 39422
Benchmark Results
Failed
TP 0 FP 0 TN 36208 FN 3214
None
y near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 14542
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.8934010152284264
Specificity : 0.7657004830917874
Accuracy: 0.8279702970297029
ROC 0.82955074916
TP 704 FP 194 TN 634 FN 84
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.9194150591163659
Specificity : 0.09403999116217411
Accuracy: 0.1613312363654812
ROC 0.506727525139
TP 2955 FP 32803 TN 3405 FN 259
None
x near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 343562
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.7889447236180904
Specificity : 0.8219512195121951
Accuracy: 0.8056930693069307
ROC 0.805447971565
TP 628 FP 146 TN 674 FN 168
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.8503422526446796
Specificity : 0.16767012814847548
Accuracy: 0.22332707625183906
ROC 0.509006190397
TP 2733 FP 30137 TN 6071 FN 481
None
Chemical Vector
In [3]:
par = ["pass", "ADASYN", "SMOTEENN", "random_under_sample", "ncl", "near_miss"]
for i in par:
print("y", i)
y = Predictor()
y.load_data(file="Data/Training/k_acetylation.csv")
y.process_data(vector_function="chemical", amino_acid="K", imbalance_function=i, random_data=0)
y.supervised_training("forest")
y.benchmark("Data/Benchmarks/acet.csv", "K")
del y
print("x", i)
x = Predictor()
x.load_data(file="Data/Training/k_acetylation.csv")
x.process_data(vector_function="chemical", amino_acid="K", imbalance_function=i, random_data=1)
x.supervised_training("forest")
x.benchmark("Data/Benchmarks/acet.csv", "K")
del x
y pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Training Data Points: 296118
Test Data Points: 32902
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 32087 FN 815
None
Number of data points in benchmark 39422
Benchmark Results
Failed
TP 0 FP 0 TN 36208 FN 3214
None
x pass
Loading Data
Loaded Data
Working on Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 625138
Test Data Points: 32902
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 32071 FN 831
None
Number of data points in benchmark 39422
Benchmark Results
Failed
TP 0 FP 0 TN 36208 FN 3214
None
y ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 576882
Test Data Points: 64098
Starting Training
Done training
Test Results
Sensitivity: 0.6721301239424308
Specificity : 0.4784045903888733
Accuracy: 0.5752129551624076
ROC 0.575267357166
TP 21529 FP 16726 TN 15341 FN 10502
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.6916614810205352
Specificity : 0.48519664162615994
Accuracy: 0.5020293237278677
ROC 0.588429061323
TP 2223 FP 18640 TN 17568 FN 991
None
x ADASYN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 905902
Test Data Points: 64098
Starting Training
Done training
Test Results
Sensitivity: 0.4699325590108655
Specificity : 0.6490177736202059
Accuracy: 0.5595338388093232
ROC 0.559475166316
TP 15051 FP 11256 TN 20814 FN 16977
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.4586185438705663
Specificity : 0.6374281926646045
Accuracy: 0.6228501851757902
ROC 0.548023368268
TP 1474 FP 13128 TN 23080 FN 1740
None
y SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 512162
Test Data Points: 56907
Starting Training
Done training
Test Results
Sensitivity: 0.7821850209257293
Specificity : 0.4968861746152919
Accuracy: 0.6574059430298558
ROC 0.639535597771
TP 25044 FP 12522 TN 12367 FN 6974
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.7638456751711263
Specificity : 0.38242929739284137
Accuracy: 0.41352544264623814
ROC 0.573137486282
TP 2455 FP 22361 TN 13847 FN 759
None
x SMOTEENN
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 841375
Test Data Points: 56929
Starting Training
Done training
Test Results
Sensitivity: 0.6220494479863162
Specificity : 0.6496730443206588
Accuracy: 0.6340705088794815
ROC 0.635861246153
TP 20002 FP 8679 TN 16095 FN 12153
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.5690728064716863
Specificity : 0.5117929739284136
Accuracy: 0.5164628887423266
ROC 0.5404328902
TP 1829 FP 17677 TN 18531 FN 1385
None
y random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 14542
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.474009900990099
Specificity : 0.6633663366336634
Accuracy: 0.5686881188118812
ROC 0.568688118812
TP 383 FP 272 TN 536 FN 425
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.4925326695706285
Specificity : 0.6494421122403888
Accuracy: 0.6366495865252905
ROC 0.570987390906
TP 1583 FP 12693 TN 23515 FN 1631
None
x random_under_sample
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 343562
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.19748427672955976
Specificity : 0.8465286236297198
Accuracy: 0.5272277227722773
ROC 0.52200645018
TP 157 FP 126 TN 695 FN 638
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.21437461107654013
Specificity : 0.8248453380468405
Accuracy: 0.7750748313124651
ROC 0.519609974562
TP 689 FP 6342 TN 29866 FN 2525
None
y ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 282356
Test Data Points: 31373
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 30578 FN 795
None
Number of data points in benchmark 39422
Benchmark Results
Failed
TP 0 FP 0 TN 36208 FN 3214
None
x ncl
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 611376
Test Data Points: 31373
Starting Training
Done training
Test Results
Failed
TP 0 FP 0 TN 30526 FN 847
None
Number of data points in benchmark 39422
Benchmark Results
Failed
TP 0 FP 0 TN 36208 FN 3214
None
y near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Training Data Points: 14542
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.6592317224287485
Specificity : 0.8949320148331273
Accuracy: 0.7772277227722773
ROC 0.777081868631
TP 532 FP 85 TN 724 FN 275
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.6477909147479776
Specificity : 0.24698961555457358
Accuracy: 0.27966617624676576
ROC 0.447390265151
TP 2082 FP 27265 TN 8943 FN 1132
None
x near_miss
Loading Data
Loaded Data
Working on Data
Balancing Data
Balanced Data
Finished working with Data
Random Sequences Generated 329020
Filtering Random Data
Random Data Added: 329020
Finished with Random Data
Training Data Points: 343562
Test Data Points: 1616
Starting Training
Done training
Test Results
Sensitivity: 0.2387332521315469
Specificity : 0.9484276729559749
Accuracy: 0.5878712871287128
ROC 0.593580462544
TP 196 FP 41 TN 754 FN 625
None
Number of data points in benchmark 39422
Benchmark Results
Sensitivity: 0.30367143746110764
Specificity : 0.6641626159964649
Accuracy: 0.6347724620770129
ROC 0.483917026729
TP 976 FP 12160 TN 24048 FN 2238
None
Content source: vzg100/Post-Translational-Modification-Prediction
Similar notebooks: